[SOUND].
So, as we explained the different text
representation tends to
enable different analysis.
In particular,
we can gradually add more and
more deeper analysis results
to represent text data.
And that would open up a more
interesting representation
opportunities and
also analysis capacities.
So, this table summarizes
what we have just seen.
So the first column shows
the text representation.
The second visualizes the generality
of such a representation.
Meaning whether we can do this
kind of representation accurately for
all the text data or only some of them.
And the third column shows
the enabled analysis techniques.
And the final column shows some
examples of application that
can be achieved through this
level of representation.
So let's take a look at them.
So as a stream text can only be processed
by stream processing algorithms.
It's very robust, it's general.
And there was still some interesting
applications that can be down
at this level.
For example, compression of text.
Doesn't necessarily need to
know the word boundaries.
Although knowing word boundaries
might actually also help.
Word base repetition is a very
important level of representation.
It's quite general and
relatively robust, indicating they
were a lot of analysis techniques.
Such as word relation analysis,
topic analysis and sentiment analysis.
And there are many applications that can
be enabled by this kind of analysis.
For example, thesaurus discovery has
to do with discovering related words.
And topic and
opinion related applications are abounded.
And there are, for example, people
might be interesting in knowing the major
topics covered in the collection of texts.
And this can be the case
in research literature.
And scientists want to know what are the
most important research topics today.
Or customer service people might want to
know all our major complaints from their
customers by mining their e-mail messages.
And business intelligence
people might be interested in
understanding consumers' opinions about
their products and the competitors'
products to figure out what are the
winning features of their products.
And, in general, there are many
applications that can be enabled by
the representation at this level.
Now, moving down, we'll see we can
gradually add additional representations.
By adding syntactical structures,
we can enable, of course,
syntactical graph analysis.
We can use graph mining algorithms
to analyze syntactic graphs.
And some applications are related
to this kind of representation.
For example,
stylistic analysis generally requires
syntactical structure representation.
We can also generate
the structure based features.
And those are features that might help us
classify the text objects into different
categories by looking at the structures
sometimes in the classification.
It can be more accurate.
For example,
if you want to classify articles into
different categories corresponding
to different authors.
You want to figure out which of
the k authors has actually written
this article, then you generally need
to look at the syntactic structures.
When we add entities and relations,
then we can enable other techniques
such as knowledge graph and
answers, or information network and
answers in general.
And this analysis enable
applications about entities.
For example,
discovery of all the knowledge and
opinions about real world entities.
You can also use this level representation
to integrate everything about
anything from scaled resources.
Finally, when we add logical predicates,
that would enable large inference,
of course.
And this can be very useful for
integrating analysis of
scattered knowledge.
For example,
we can also add ontology on top of the,
extracted the information from text,
to make inferences.
A good of example of application in this
enabled by this level of representation,
is a knowledge assistant for biologists.
And this program that can help a biologist
manage all the relevant knowledge from
literature about a research problem such
as understanding functions of genes.
And the computer can make inferences
about some of the hypothesis that
the biologist might be interesting.
For example,
whether a gene has a certain function, and
then the intelligent program can read the
literature to extract the relevant facts,
doing compiling and
information extracting.
And then using a logic system to
actually track that's the answers
to researchers questioning about what
genes are related to what functions.
So in order to support
this level of application
we need to go as far as
logical representation.
Now, this course is covering techniques
mainly based on word based representation.
And these techniques are general and
robust and that's more widely
used in various applications.
In fact, in virtually all the text mining
applications you need this level of
representation and then techniques that
support analysis of text in this level.
But obviously all these other
levels can be combined and
should be combined in order to support
the sophisticated applications.
So to summarize,
here are the major takeaway points.
Text representation determines what
kind of mining algorithms can be applied.
And there are multiple ways to
represent the text, strings, words,
syntactic structures, entity-relation
graphs, knowledge predicates, etc.
And these different
representations should in general
be combined in real applications
to the extent we can.
For example, even if we cannot
do accurate representations
of syntactic structures, we can state
that partial structures strictly.
And if we can recognize some entities,
that would be great.
So in general we want to
do as much as we can.
And when different levels
are combined together,
we can enable a richer analysis,
more powerful analysis.
This course however focuses
on word-based representation.
Such techniques have also several
advantage, first of they are general and
robust, so they are applicable
to any natural language.
That's a big advantage over
other approaches that rely on
more fragile natural language
processing techniques.
Secondly, it does not require
much manual effort, or
sometimes, it does not
require any manual effort.
So that's, again, an important benefit,
because that means that you can apply
it directly to any application.
Third, these techniques are actually
surprisingly powerful and
effective form in implications.
Although not all of course
as I just explained.
Now they are very effective
partly because the words
are invented by humans as basically
units for communications.
So they are actually quite sufficient for
representing all kinds of semantics.
So that makes this kind of word-based
representation all so powerful.
And finally, such a word-based
representation and the techniques enable
by such a representation can be combined
with many other sophisticated approaches.
So they're not competing with each other.
[MUSIC]

